CD8 TIL projection uncover cells types hidden by transient gene programs
Cell-Cycle, interferon and Tissue-residency gene programs
Background
Cells responding to external cues might overexpress transient gene programs. In scRNA-seq data, clusters will be driven by the strongest variation component, which can be sometimes transient gene programs. These programs such as the cell cycle, might hide the original, potentially more stable cell type beneath.
These are among the main types of program that classically are being upregulated in scRNA-seq data:
Cell cycle, when cells go through division (eg. MCM5, TOP2A)
IFN response, in response to IFN stimulation (eg. MX1, ISG15)
HSP response, in response to stress (eg. HSPA1, HSPA2)
Activation response, in response to activation signal, eg TCR engagement (eg. JUN, FOS)
Tissue-residency program, when cell infiltrate and home within tissues (eg. ZNF683, ITGAE)
Note: while some programs are clearly transient (like the cell cycle), some programs might be more permanent / epigenetically fixed. The distinction between transient and permanent might be harder for some programs, like tissue-residency. We argue here than having a well-curated reference, free from many clearly transient gene programs, will help in distinguishing transient from more stable cell states.
Here we will see how we use reference-based projection to uncover the original cell types from IFN-activated, Cycling and Tissue-resident cells using data from Gueguen et al. 2021.
How to identify gene programs?
Coregulated gene sets can also be found using unsupervised
techniques. Gene modules can be found using PCA, ICA, cNMF, scCoGAPS,
or find_gene_modules from Monocle3,
which runs UMAP on the genes and then groups them into modules using
Louvain clustering. Here, for simplicity, we chose to focus on easily
explainable, explicit programs.
Data setup
Loading the CD8 reference
First let’s load the CD8 TIL reference by downloading it from Figshare.
# Load the reference
options(timeout = max(900, getOption("timeout")))
#download.file("https://figshare.com/ndownloader/files/38921366", destfile = "CD8T_human_ref_v1.rds")
ref.cd8 <- load.reference.map("CD8T_human_ref_v1.rds")## [1] "Loading Custom Reference Atlas..."
## [1] "Loaded Custom Reference map Human CD8 TILs"
# Setup colors
mycols <- ref.cd8@misc$atlas.palette
# Plotting
DimPlot(ref.cd8, group.by = 'functional.cluster', label = T, repel = T, cols = mycols) + theme(aspect.ratio = 1)We see that the reference is composed only of what should only be stable cell types. Indeed, cells expressing highly transient gene programs (Cycling and IFN) were removed when constructing the reference.
Load Gueguen et al. data
#download.file("https://figshare.com/ndownloader/files/39082049", destfile = "gueguen.cd3.Rds")
gueguen.cd3 <- readRDS("gueguen.cd3.Rds")
gueguen.cd3$seurat_clusters <- Idents(gueguen.cd3)
# DimPlot
DimPlot(gueguen.cd3, order = T, label = T, repel = T) Projection using reference map
The first step is to project using ProjecTILs
ProjecTILs.classifier function. This mode will give us the
labels, while keeping the original UMAP embeddings.
# Projection
DefaultAssay(gueguen.cd3) <- "RNA"
gueguen.cd3 <- ProjecTILs.classifier(gueguen.cd3, ref = ref.cd8, filter.cells = T, split.by = 'orig.ident', ncores = 6)
table(gueguen.cd3$functional.cluster)##
## CD8.CM CD8.EM CD8.MAIT CD8.NaiveLike CD8.TEMRA
## 2132 3181 147 238 276
## CD8.TEX CD8.TPEX
## 3340 337
# Plotting
DimPlot(gueguen.cd3, group.by = 'functional.cluster', order = T, cols = mycols, label = T, repel = T)Augmenting cell type classifications with gene program scores
To further augment granularity, here we divide the dataset by computing signatures scores using UCell. To this end, we use our collection of useful transcriptional gene signatures: SignatuR, as well as manually curated signatures.
Here, we will use signatures generated for:
G1S cell cycle
G2M cell cycle
IFN response
Tissue residency
library(SignatuR)
signatures <- GetSignature(SignatuR$Hs)
signatures <- signatures[c("cellCycle.G1S","cellCycle.G2M")]
# Using manual signatures for IFN and resident, which work better with UCell than the full signatures from SignatuR package
signatures[["IFN"]] <- c("ISG15","IFI6","IFI44L","MX1")
signatures[["Tissue_Resident"]] <- c("ITGAE")Now let’s add the scores using UCell.
gueguen.cd3 <- AddModuleScore_UCell(gueguen.cd3, features = signatures, ncores = 8,
assay = "RNA")Should I use unsupervised clusters or UCell scores to measure gene programs?
It all depends on the dataset that you have. If you have annotated your dataset with ProjecTILs, you won’t have any cluster driven by transient programs. In this case, you can use UCell signature scoring. Else, if you have clusters clearly driven by transient gene programs, you can use ProjecTILs mapping to unravel the underlying cell types within these clusters of interest.
FeaturePlot(gueguen.cd3, features = paste0(names(signatures), "_UCell"),
order = T, pt.size = 0.3) & scale_color_paletteer_c("pals::coolwarm") ## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
We see that these signatures correlate with their annotation of the original paper.
Now, let’s check signatures scores distribution to decide which threshold put to distinguish negative/positive cells.
Signatures scores distributions
qplot(gueguen.cd3$cellCycle.G1S_UCell, main = "G1S-Cycling") +
qplot(gueguen.cd3$cellCycle.G2M_UCell, main = "G2M-Cycling") + qplot(gueguen.cd3$IFN_UCell, main = "IFN") + qplot(gueguen.cd3$Tissue_Resident_UCell, main = "Tissue Resident") & theme_bw()# subset resident high cells in the CD8 reference
resident.high <- subset(gueguen.cd3, subset = Tissue_Resident_UCell > 0.2)
resident.low <- subset(gueguen.cd3, subset = Tissue_Resident_UCell <= 0.2)
# Set list
resident.list <- list(resident.low,resident.high)
names(resident.list) <- c("resident.low","resident.high")# Radars to check differences between resident / non resident
plot.states.radar(ref = ref.cd8, query = resident.list, labels.col = 'functional.cluster', min.cells = 5, genes4radar = c("TCF7","CCR7","KLF2","GZMK","LMNA","FCGR3A","FGFBP2","XCL1","KLRB1", 'ZNF683',"ITGAE", 'ITGA1')) Resident-high and resident-low cells display the same overall phenotypes, but differ in their expression of key resident genes, upregulated when cells home whithin tissue, such as integrins (ITGAE, ITGA1) or the transcription factor HOBIT (ZNF683).
Indeed, the radars confirm that the annotation seems of good quality. Radar are good way to show that phenotypes match between the reference and the query, but they differ by the resident genes (ZNF683, ITGA1, ITGAE and CXCR6). We also confirm that the cluster CD8-XCL1, originally identified as an early-precursors, indeed match well the phenotype of Central Memory cells, while still being high for tissue-residency signal.
By looking at the UCell scores distribution, we categorised the cells by putting a UCell score threshold of 0.2.
wrap_plots(pll, guides = "collect")As previously shown, CD8.TEX and CD8.TPEX are the most cycling,
resident and IFN-responding subsets. For the IFN signal, results are
inconsistent between UCell signature scores and projected
clustes. Results will vary depending on the signatures chosen, the UCell
threshold chosen, as well as the clustering resolution.
Upset plot to check programs cooccurences
Let’s focus on the CD8.TEX population, and see how the different gene programs relate using a UpSet plot.
gueguen.cd3.TEX <- subset(gueguen.cd3, subset = functional.cluster == "CD8.TEX")
listInput <- list(IFN = which(gueguen.cd3.TEX$IFN.class == "Positive"),
Cycling.G2M = which(gueguen.cd3.TEX$cellCycle.G2M.class == "Positive"),
Cycling.G1S = which(gueguen.cd3.TEX$cellCycle.G1S.class == "Positive"),
Tissue_resident = which(gueguen.cd3.TEX$Tissue_Resident.class == "Positive"))
upset(fromList(listInput), order.by = "freq")
grid.text("CD8.TEX gene programs overlap",x = 0.65, y=0.95, gp=gpar(fontsize=12))We can see that in CD8.TEX, broadly half of the IFN high cells are also high for the tissue-residency program. This should be explored further.
Comparison with the original annotation
In the original study, the authors use Seurat
TransferData function to tranfer labels from the
non-cycling clusters to the cycling clusters (‘CD4/8-TOP2A’,
‘CD4/8-MCM5’). With label tranfer, Cycling cells are defined as CD8.LAYN
and CD8.GZMH.
Now let’s compare this result with our projection results. First, let’s select the 2 cycling clusters.
gueguen.cycling <- subset(gueguen.cd3, subset = seurat_clusters %in% c('CD4/8-TOP2A', "CD4/8-MCM5"))
# Dimplot
DimPlot(gueguen.cycling, group.by = 'functional.cluster', label = T, repel = T, cols = mycols)## Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Clusters proportions in cycling
plot.statepred.composition(ref.cd8, gueguen.cycling, metric = "Percent") +
ggtitle("Cycling cells") + ylim(0, 75) + theme_bw() + RotatedAxis()Let’s check radars plots to confirm identities.
plot.states.radar(ref = ref.cd8, query = gueguen.cycling, labels.col = 'functional.cluster', min.cells = 5, genes4radar = c('IL7R','LEF1', "TCF7","CCR7","KLF2","GZMK","FCGR3A","FGFBP2","XCL1","KLRB1",'TOP2A','MCM5','MCM3'))Radar plots look good. It seems that we have indeed less differentiated cells in the cycling compartment (Central Memory and effector memory). Projection seems more robust than Seurat label transfer to uncover cell types hidden behind transient gene programs.
Session Info
sessionInfo()## R version 4.2.1 (2022-06-23)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices datasets utils methods
## [8] base
##
## other attached packages:
## [1] SignatuR_0.1.0 UpSetR_1.4.0 cowplot_1.1.1
## [4] plotly_4.10.1 pheatmap_1.0.12 UCell_2.2.0
## [7] paletteer_1.5.0 scales_1.2.1 ggrepel_0.9.2
## [10] gridExtra_2.3 ProjecTILs_3.0.2 patchwork_1.1.2
## [13] GEOquery_2.66.0 Biobase_2.58.0 BiocGenerics_0.44.0
## [16] data.table_1.14.6 STACAS_2.0.0 scGate_1.4.1
## [19] forcats_0.5.2 stringr_1.5.0 dplyr_1.0.10
## [22] purrr_1.0.1 readr_2.1.3 tidyr_1.2.1
## [25] tibble_3.1.8 ggplot2_3.4.0 tidyverse_1.3.2
## [28] SeuratObject_4.1.3 Seurat_4.3.0
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.2 spatstat.explore_3.0-5
## [3] reticulate_1.27 tidyselect_1.2.0
## [5] htmlwidgets_1.6.1 BiocParallel_1.32.5
## [7] Rtsne_0.16 munsell_0.5.0
## [9] codetools_0.2-18 ica_1.0-3
## [11] umap_0.2.9.0 future_1.30.0
## [13] miniUI_0.1.1.1 withr_2.5.0
## [15] spatstat.random_3.1-2 colorspace_2.1-0
## [17] progressr_0.13.0 highr_0.10
## [19] knitr_1.41 rstudioapi_0.14
## [21] stats4_4.2.1 SingleCellExperiment_1.20.0
## [23] ROCR_1.0-11 tensor_1.5
## [25] listenv_0.9.0 labeling_0.4.2
## [27] MatrixGenerics_1.10.0 GenomeInfoDbData_1.2.9
## [29] polyclip_1.10-4 farver_2.1.1
## [31] parallelly_1.34.0 vctrs_0.5.2
## [33] generics_0.1.3 xfun_0.36
## [35] timechange_0.2.0 R6_2.5.1
## [37] GenomeInfoDb_1.34.7 rmdformats_1.0.4
## [39] pals_1.7 bitops_1.0-7
## [41] spatstat.utils_3.0-1 cachem_1.0.6
## [43] DelayedArray_0.24.0 assertthat_0.2.1
## [45] promises_1.2.0.1 googlesheets4_1.0.1
## [47] gtable_0.3.1 globals_0.16.2
## [49] goftest_1.2-3 rlang_1.0.6
## [51] splines_4.2.1 lazyeval_0.2.2
## [53] gargle_1.2.1 dichromat_2.0-0.1
## [55] prismatic_1.1.1 spatstat.geom_3.0-5
## [57] broom_1.0.2 BiocManager_1.30.19
## [59] yaml_2.3.7 reshape2_1.4.4
## [61] abind_1.4-5 modelr_0.1.10
## [63] backports_1.4.1 httpuv_1.6.8
## [65] tools_4.2.1 bookdown_0.32
## [67] ellipsis_0.3.2 jquerylib_0.1.4
## [69] RColorBrewer_1.1-3 ggridges_0.5.4
## [71] Rcpp_1.0.10 plyr_1.8.8
## [73] zlibbioc_1.44.0 RCurl_1.98-1.9
## [75] openssl_2.0.5 deldir_1.0-6
## [77] pbapply_1.7-0 S4Vectors_0.36.1
## [79] zoo_1.8-11 SummarizedExperiment_1.28.0
## [81] haven_2.5.1 cluster_2.1.4
## [83] fs_1.6.0 magrittr_2.0.3
## [85] RSpectra_0.16-1 scattermore_0.8
## [87] lmtest_0.9-40 reprex_2.0.2
## [89] RANN_2.6.1 googledrive_2.0.0
## [91] fitdistrplus_1.1-8 matrixStats_0.63.0
## [93] hms_1.1.2 mime_0.12
## [95] evaluate_0.20 xtable_1.8-4
## [97] readxl_1.4.1 IRanges_2.32.0
## [99] compiler_4.2.1 maps_3.4.1
## [101] KernSmooth_2.23-20 crayon_1.5.2
## [103] htmltools_0.5.4 later_1.3.0
## [105] tzdb_0.3.0 lubridate_1.9.0
## [107] DBI_1.1.3 dbplyr_2.3.0
## [109] MASS_7.3-58.2 data.tree_1.0.0
## [111] Matrix_1.5-3 cli_3.6.0
## [113] parallel_4.2.1 igraph_1.3.5
## [115] GenomicRanges_1.50.2 pkgconfig_2.0.3
## [117] sp_1.6-0 spatstat.sparse_3.0-0
## [119] xml2_1.3.3 bslib_0.4.2
## [121] XVector_0.38.0 rvest_1.0.3
## [123] digest_0.6.31 pracma_2.4.2
## [125] sctransform_0.3.5 RcppAnnoy_0.0.20
## [127] spatstat.data_3.0-0 rmarkdown_2.20
## [129] cellranger_1.1.0 leiden_0.4.3
## [131] uwot_0.1.14 shiny_1.7.4
## [133] lifecycle_1.0.3 nlme_3.1-161
## [135] jsonlite_1.8.4 BiocNeighbors_1.16.0
## [137] mapproj_1.2.9 askpass_1.1
## [139] viridisLite_0.4.1 limma_3.54.0
## [141] fansi_1.0.4 pillar_1.8.1
## [143] lattice_0.20-45 fastmap_1.1.0
## [145] httr_1.4.4 survival_3.5-0
## [147] glue_1.6.2 png_0.1-8
## [149] stringi_1.7.12 sass_0.4.5
## [151] rematch2_2.1.2 renv_0.15.5
## [153] irlba_2.3.5.1 future.apply_1.10.0